System And Method For Supporting Text-To-Speech

ABSTRACT

A system for generating high-quality synthesized text-to-speech includes a learning data generating unit, a frequency data generating unit, and a setting unit. The learning data generating unit recognizes inputted speech, and then generates first learning data in which wordings of phrases are associated with readings thereof. The frequency data generating unit generates, based on the first learning data, frequency data indicating appearance frequencies of both wordings and readings of phrases. The setting unit sets the thus generated frequency data for a language processing unit in order to approximate outputted speech of text-to-speech to the inputted speech. Furthermore, the language processing unit generates, from a wording of text, a reading corresponding to the wording, on the basis of the appearance frequencies.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority from Japanese Patent Application No.2006-224110 filed on Aug. 21, 2006.

FIELD OF THE INVENTION

This invention relates generally to the field of text-to-speech, andmore particularly to a system for improving accuracy of text-to-speechby causing a language processing unit to learn;

BACKGROUND OF THE INVENTION

In text-to-speech (TTS), in order to output speech which is easilyunderstandable and natural for a listener, it is desirable to accuratelydetermine a way of reading (hereafter, simply called a reading)incorporating not only pronunciations but also accents. In conventionaltext-to-speech techniques, generation of accents is realized in a mannerthat numerous rules for determining appropriate accents are found out ona trial-and-error basis by analyzing standard speeches of an announceror the like. However, generation of the appropriate rules requiresvarious kinds of work performed by experts, and there has been a risk ofrequiring enormous costs and time.

There has been proposed a technique for determining a pronunciation andan accent of a phrase in inputted text by using statistical information,instead of rules, such as appearance frequencies of pronunciations andaccents of the phrase in previously provided learning data. See Nagano,Mori, and Nishimura, “Kakuritsuteki model wo mochiita yomikata oyobiakusento suitei (Reading and Accent Estimation Using Stochastic Model),”SIG-SLP57 (July, 2005). According to this technique, accurate appearancefrequencies can be computed on the premise that a sufficient amount ofthe learning data is available, and the processing for generatingaccents can be made more efficient since it is not necessary to generaterules.

However, in the abovementioned technique using the statisticalinformation requires a large amount of learning data for which accuratepronunciations and accents are provided. In order to generate suchlearning data, it is required that experts who are conversant withclassification of accents and the like manually provide information onaccents to each phrase. On the other hand, in sound processing forgenerating actual speech from information on reading such aspronunciations and accents, data on waveforms of speech actuallyvocalized by an announcer or the like are often utilized. See Eide, E.,et al., “Recent Improvements to the IBM. Trainable Speech SynthesisSystem” Proc. ICASSP 2003, Hong Kong, Vol. 1, pp. 708-711 (April, 2003).For this reason, outputted speech sometimes becomes unnatural becauseinconsistency occurs between the information on accents manuallyprovided, and synthesized speech utilizing the actual speech.

Consequently, an object of the present invention is to provide a system,a method and a program which are capable of solving the abovementionedproblem. This object can be achieved by a combination of characteristicsdescribed in the independent claims in the scope of claims.Additionally, the subordinate claims therein define further advantageousspecific examples.

SUMMARY OF THE INVENTION

Briefly stated, a system for generating high-quality synthesizedtext-to-speech includes a learning data generating unit, a frequencydata generating unit, and a setting unit. The learning data generatingunit recognizes inputted speech, and then generates first learning datain which wordings of phrases are associated with readings thereof. Thefrequency data generating unit generates, based on the first learningdata, frequency data indicating appearance frequencies of both wordingsand readings of phrases. The setting unit sets the thus generatedfrequency data for a language processing unit in order to approximateoutputted speech of text-to-speech to the inputted speech. Furthermore,the language processing unit generates, from a wording of text, areading corresponding to the wording, on the basis of the appearancefrequencies.

According to an embodiment of the invention, a system for supportingtext-to-speech includes a learning data generating unit which recognizesinputted speech, and generates first learning data in which wordings ofphrases are associated with readings thereof; a frequency datagenerating unit which generates, on the basis of the first learningdata, frequency data indicating appearance frequencies of both wordingsand readings of phrases; a language processing unit; and a setting unitwhich sets frequency data in the language processing unit forgenerating, from a wording of text, a reading corresponding to thewording, on the basis of appearance frequencies of readingscorresponding to the wording in order to approximate outputted speech oftext-to-speech to the inputted speech.

According to an embodiment of the invention, a system for supportingtext-to-speech includes a learning data generating unit which recognizesinputted speech, and generates first learning data in which wordings ofphrases are associated with readings thereof; a language processingunit; and a learning unit which causes the language processing unit tolearn on the basis of the first learning data, the language processingunit generating, from a wording of text, a reading corresponding to thewording, on the basis of appearance frequencies in the first learningdata in order to approximate outputted speech of text-to-speech to theinputted speech.

According to an embodiment of the invention, a method of supportingtext-to-speech includes the steps of: (a) recognizing inputted speech,and generating first learning data in which wordings of phrases areassociated with readings thereof; (b) generating, on the basis of thefirst learning data, frequency data indicating appearance frequencies ofboth wordings, and readings of phrases; and (c) setting frequency datain a language processing unit which generates, from a wording of text, areading corresponding to the wording, on the basis of appearancefrequencies of readings corresponding to the wording in order toapproximate outputted speech of text-to-speech to the inputted speech.

According to an embodiment of the invention, a program product forallowing an information processing apparatus to function as a system forsupporting text-to-speech causes the information system to function as alearning data generating unit which recognizes inputted speech, andgenerates first learning data in which wordings of phrases areassociated with readings thereof; a frequency data generating unit whichgenerates, on the basis of the first learning data, frequency dataindicating appearance frequencies of both wordings, and readings ofphrases; and a setting unit which, in order to approximate outputtedspeech of text-to-speech to the inputted speech, sets frequency data ina language processing unit for generating, from a wording of text, areading corresponding the wording, on the basis of appearancefrequencies of readings corresponding to the wording.

According to an embodiment of the invention, an article of manufacturecomprises a computer usable medium having computer readable program codemeans embodied therein for supporting text-to-speech, the computerreadable program code means in the article of manufacture includingcomputer readable program code means for causing a computer to effectrecognizing inputted speech and generating first learning data in whichwordings of phrases are associated with readings thereof; computerreadable program code means for causing a computer to effect generating,on the basis of the first learning data, frequency data indicatingappearance frequencies of both wordings, and readings of phrases; andcomputer readable program code means for causing a computer to effectsetting frequency data in a language processing unit which generates,from a wording of text, a reading corresponding to the wording, on thebasis of appearance frequencies of readings corresponding to the wordingin order to approximate outputted speech of text-to-speech to theinputted speech.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows high level configurations of a supporting system and atext-to-speech processing unit.

FIG. 2 shows an entire configuration of the supporting system.

FIG. 3 shows one example of first learning data.

FIG. 4 shows one example of frequency data.

FIG. 5 shows one example of processing in which data of various kindsare set in the text-to-speech processing unit by the supporting system.

FIG. 6 shows frequencies measured from speech for learning.

FIG. 7 shows confidence parts among the measured frequencies.

FIG. 8 shows specific examples of estimated accent phrases.

FIG. 9 shows one example of text-to-speech processing performed based onfrequency data having been set.

FIG. 10 shows one example of a hardware configuration of an informationprocessing apparatus which functions as the supporting system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 shows high level configurations of a supporting system 20 and atext-to-speech processing unit 50. An information system shown in FIG. 1includes the supporting system 20 and the text-to-speech processing unit50. Upon acceptance of an input of text, the text-to-speech processingunit 50 generates and outputs speech reading out the text. Specifically,the text-to-speech processing unit 50 includes a language processingunit 52 and a sound processing unit 54. The language processing unit 52generates a reading corresponding to a wording of inputted text, basedon appearance frequencies of readings of the text, and outputs thereading to the sound processing unit 54. A reading means, for example,kinds of pronunciations, pitches of accents, pausing positions ofpronunciations, and the like, and does not mean specific waveforms ofspeech. Then, the sound processing unit 54 generates speech based onthis reading which has been inputted thereto, and outputs the speech.

The supporting system 20 according to this embodiment accepts an inputof speech for learning, generates information, such as appearancefrequencies of pronunciations and accents of phrases, based on thisspeech, and sets the information in the text-to-speech processing unit50. Thereby, the supporting system 20 is intended to efficientlygenerate high-quality synthesized speech by supporting thetext-to-speech processing unit 50.

FIG. 2 shows an entire configuration of the supporting system 20. Thesupporting system 20 includes a learning data generating unit 200 and alearning unit 250. The learning data generating unit 200 recognizesinputted speech, and generates first learning data 30 in which wordingsof phrases are associated with readings thereof. Additionally, thelearning data generating unit 200 may generate second learning data inwhich readings of phrases are associated with waveform data of phonemes,and the like. These sets of learning data thus generated are outputtedto the learning unit 250. In order to approximate outputted speech oftext-to-speech to this inputted speech, the learning unit 250 causes thelanguage processing unit 52 to learn based on the first learning data30, and causes the sound processing unit 54 to learn based on the secondlearning data.

The learning data generating unit 200 includes a speech recognizing unit210, a fundamental frequency extracting unit 220, an accent phraseestimating unit 230, and an accent determining unit 240. The speechrecognizing unit 210 recognizes the inputted speech, generates the firstlearning data 30 in which wordings of phrases are associated withreadings thereof, and outputs the first learning data 30 to the learningunit 250. The speech recognizing unit 210 may receive, as an input,speech reading out previously determined text for learning, and generatethe first learning data 30 by associating, with the text for learning, areading recognized thereby. Additionally, the speech recognizing unit210 may collectively recognize wordings and readings of phrases byrecognizing speech freely pronounced by a speaker, and generate thefirst learning data 30. In this case, highly accurate recognitionbecomes difficult in some cases, and therefore, the speech recognizingunit 210 may display a recognition result to a user, and then accept acorrecting operation by the user.

The fundamental frequency extracting unit 220 measures frequencies ofthe inputted speech, and outputs to the accent phrase estimating unit230 data in which the frequencies are arranged in chronological order.Based on the data on the frequencies which has been inputted, the accentphrase estimating unit 230 divides the speech into plural accentphrases, and outputs to the accent determining unit 240 a result of thedivision along with the data on the frequencies. The accent determiningunit 240 classifies an accent of the speech in each of the accentphrases into any one of predetermined plural kinds of accents, andoutputs a group of kinds of accents with respect to each of the accentphrases. Information on these accents is outputted as the first learningdata 30 to the learning unit 250 associated with the result of speechrecognition by the speech recognizing unit 210. Accordingly, a readingof phrases contained in the first learning data 30 contains not onlypronunciations recognized by the speech recognizing unit 210 but alsoaccents of phrases determined by the accent determining units 240. Oneexample thereof is shown in FIG. 3.

FIG. 3 shows one example of the first learning data 30. In the firstlearning data 30, phrases recognized in the inputted speech are recordedin the order in which the phrases are pronounced, in a manner that eachof the phrases is associated with a wording w of each of the phrases, aword class t thereof, a pronunciation s thereof, and an accent athereof. For example, phrases “kyouto (Kyoto),” “tawaa (tower)” and“hoteru (hotel)” are sequentially pronounced in the first half part ofthe inputted speech, and word classes' recognized are proper noun,general noun and general noun, respectively. In addition, pronunciationsof these phrases are “kyo:to,” “tawa:” and “hoteru,” respectively, wherethe symbol “:” indicates a prolonged sound. Furthermore, accents ofthese phrases are “LHH,” “HHH” and “HLL,” respectively. As one example,the accent of“LHH” indicates that the pronunciation “kyo:to” ispronounced in the order of “low (L),” “high (H)” and “high (H).”

Thus, the speech recognizing unit 210 determines separations betweenphrases contained in speech, and judges a wording, a word class and apronunciation of each of the phrases through speech recognitionprocessing. These pieces of information can be acquired by applyingconventional speech recognition technologies. For example, in theconventional speech recognition technologies for converting speech intotext, it is often the case that: separations between phrases in thespeech, and word classes of the respective phrases are judged in theform of internal processing in order to enhance accuracy in therecognition; and the text alone is finally outputted with results ofsuch internal judgment being excluded from the output. By using theresults of such internal judgment in any one of the conventional speechrecognition technologies, the speech recognizing unit 210 can judge awording, a word class and a pronunciation of a phrase.

Note that, in the first learning data 30 in FIG. 3, though the phrase islikewise written as “kyouto”, the phrase is recognized in the latterhalf part of the inputted speech as a phrase having a different accent.Specifically, the phrase is recognized as having an accent of “HLL”which differs from the accent of “LHH” having been recognized in thefirst-half period. Likewise, the phrase written as.“tawaa” has differentaccents in the first-half part and in the latter-half part. Thus, thereis a case where a Japanese phrase is spoken with pronunciations andaccents which are different depending on a relation of the phrase with acontext or preceding and following phrases even if the phrase is writtenin the same wording. For this reason, in text-to-speech, unless apronunciation is determined in consideration of such differences incontext, a speech sounding natural to a user cannot be generated in manycases. The supporting system 20 according to this embodiment can outputhigh-quality synthesized speech in consideration of such differences incontext and the like.

Referring back to FIG. 2, accents determined by the accent determiningunit 240 are outputted to a sound data generation unit 290 as the secondlearning data along with information indicating pronunciationsrecognized by the speech recognizing unit 210, fundamental frequenciesextracted by the fundamental frequency extracting unit 220, andseparations between accent phrases estimated by the accent phraseestimating unit 230. The learning unit 250 includes an acquisition unit260, a frequency data generating unit 270, a setting unit 280, and thesound data generating unit 290. The acquisition unit 260 acquiresfrequency data having been already used by the language processing unit52. Based on the first learning data 30, the frequency data generatingunit 270 generates frequency data indicating a wording of phrases andappearance frequencies of readings thereof. The thus generated frequencydata 40 may be generated by synthesizing frequency data acquired by theacquisition unit 260 and frequency data determined on the basis of thefirst learning data 30.

The setting unit 280 sets the frequency data 40 in the languageprocessing unit 52 in order to approximate the outputted speech to theinputted speech for learning. Concerning the second learning data, thesound data generating unit 290 generates sound data such as waveformdata and rhythm models based on data such as readings of phrases,phoneme alignments, positions of accent phrases, positions of pauses,and the like. The sound data determines, for example, a duration, apitch, an intensity, a position of a pause, a length of the pause, andthe like of each phoneme. The setting unit 280 sets this sound data inthe sound processing unit 54.

FIG. 4 shows one example of the frequency data 40. In the frequency data40, appearance frequencies of plural different combinations of readingsare associated with each combination of plural phrases continuouslywritten. For example, a wording of “kyouto tawaa” is continuouslywritten, and pronounced in series as “kyo: totawa:” or the like. Inaddition, though the wording of “kyouto tawaa” is a combination of thephrases “kyouto” and “tawaa,” there is also a case where, depending on aperformance capability and the like of the speech recognizing unit 210,the wording is recognized as a combination of phrases “kyou” and“totawaa.” In the frequency data 40, each combination of phrases whichinclude at least one same wording as a part of the phrase is associatedwith information indicating separations between phrases, combinations ofword classes, and combinations of readings. Furthermore, appearancefrequencies of each, of these combinations are associated with each ofthe combinations of phrases.

Specifically, in the frequency data 40, with a combination of “kyoutotawaa” in, wording, information on separations between phrases isassociated, the information indicating that “kyouto” and “tawaa” aredifferent phrases. Furthermore, in the frequency data 40, with thiscombination in wording, a combination of word classes which is “propernoun: general noun” is associated. Additionally, in the frequency data40, a combination of pronunciations which is “kyo: totawa:,” and a setof kinds of accents respectively corresponding to accent phrases whichis “LHHHHH,” are associated therewith. Additionally, in the frequencydata 40, an index value of 30 indicating an appearance frequency of thecombination is associated therewith. According to an actual example suchas “kyouto tawaa hoteru,” the phrases “kyouto tawaa” is mostly However,a frequency at which this wording is thus recognized is low, and theextremely low index value of 5 is associated with this combination inreading. Therefore, this combination in reading is rarely adopted as areading generated by the language processing unit 52.

FIG. 5 shows one example of processing in which various kinds of dataare set in, the text-to-speech processing unit 50 by the supportingsystem 20. The speech recognizing unit 210 accepts an input of speechfor learning (S500). The speech recognizing unit 210 performs processingof recognition of the inputted speech (S510). By using a result of thespeech recognition and the inputted speech, the speech recognizing unit210, the fundamental frequency extracting unit 220, the accent phraseestimating unit 230, and the accent determining unit 240 generate thefirst learning data (S520) in which pronunciations, accents and the likeare associated with the wording of each phrase. With reference to FIGS.6 to 8, a specific example of this processing is described.

FIG. 6 shows frequencies measured from the speech for learning. By usinga device such as Laryngograph, the fundamental frequency extracting unit220 detects voiced sound in the form of changes over time in fundamentalfrequency. Results of detecting frequencies at predetermined measurementintervals are indicated by marks of “x” in FIG. 6.

FIG. 7 shows confidence parts among the measured frequencies. The accentphrase estimating unit 230 excludes from the measured frequencies partsjudged to be measurement errors, and selects the trustworthy parts.Selected results are indicated by cross marks in FIG. 7.

FIG. 8 shows specific examples of estimated accent phrases. In order todetermine accent phrases, first of all, the accent phrase estimatingunit 230 reads information on positions of pauses from alignments, anddivides data on fundamental frequencies by breath group. Then the accentphrase estimating unit 230 performs the following processing steps inthe following order for each of breath groups:

(0) setting all of the cross marks within each breath group as initialvalues within pronounced with accents of “LHHHHH” in a case whereanother noun is written continuously after “kyouto tawaa.” That is, thisexample indicates that a frequency at which the wording is pronounced as“kyouto tawaa” in an example such as “kyouto tawaa hoteru” is the indexvalue of 30.

Additionally, in the frequency data 40, with the same combination of“kyouto tawaa” in wording, information on separations between phrases isassociated, the information indicating that “kyouto” and “tawaa” aredifferent phrases. Furthermore, in!; the frequency data 40, with thiscombination in wording, a combination of word classes which is “propernoun: general noun” is associated. Additionally, in the frequency data40, a combination of pronunciations which is “kyo: totawa:,” and a setof kinds of accents respectively corresponding to accent phrases whichis “LHHHLL” are associated therewith. Additionally, in the frequencydata 40, an index value of 60 indicating an appearance frequency of thecombination is associated therewith. According to an actual example suchas “kyouto tawaa ni itta (went to Kyoto Tower),” the phrases “kyoutotawaa” is mostly pronounced with accents of “LHHHLL” in a case where aphrase that is not a noun is written continuously after “kyouto tawaa.”That is, this example indicates that a frequency at which the wording ispronounced as “kyouto tawaa” in an example such as “kyouto tawaa niitta” is the index value of 60.

Additionally, in the frequency data 40, with the same combination of“kyouto tawaa” in wording, information on separations between phrases isassociated, the information indicating that “kyou” and “totawaa” aredifferent phrases. Furthermore, in the frequency data 40, with thiscombination in wording, a combination of word classes which is “generalnoun: general noun” is associated. Additionally, in the frequency data40, a combination of pronunciations which is “kyototawa:,” and a set ofkinds of accents respectively corresponding to accent phrases which is“LHLHL” are associated therewith. Additionally, in the frequency data40, an index value of 5 indicating an appearance frequency of thecombination is associated therewith. Although such a separation betweenphrases as in the case with this example is originally inaccurate, arecognition result of this kind can be brought about in some casesdepending on precision in speech recognition, an error by a speaker inspeech for learning, and the like. a range;

(a) finding the mean square of errors when pitch marks within the rangeare approximated to a triangular model;

(b) if the mean square of errors is under a threshold value, the rangeis judged to be a single triangular model;

(c) if the error of mean square is not less than a threshold value, thesubject range is made into two ranges by dividing the range at a pointminimizing a total of errors of the two ranges; and

(d) performing the processing for each of the two ranges after goingback to the processing step (a).

Then, the accent phrase estimating unit 230 unites together thetriangular models, with respect to each of the breath groups, therebyfinding ranges of accents phrases. Basically, each one of the triangularmodels is a range of an accent phrase. For example, a first accentphrase 800, a second accent phrase 810, and a third accent phrase 820are each an accent phrase. However, in the following cases, the accentphrase estimating unit 230 unites together plural ones of the triangularmodels, and judges the plural ones thereof to be a single accent phrase.For example, a fourth accent phrase 830 and a fifth, accent phrase 840are united together, and are judged to be a single accent phrase,yielding three cases:

(1) A case where a length of the triangular model is shorter than apredetermined i criterial value;

(2) A case where slopes of continuing ones of the triangular models aresmaller than a predetermined criterial value; and

(3) A case where a part in which a frequency increases in a successiveone of the triangular models is shorter than a predetermined criterialvalue.

Then, the accent phrase estimating unit 230 judges, as a range of asingle accent phrase, a range from a time position of an initial pitchmark to a time position of a final pitch mark within each of the accentphrases. Subsequently, the accent determining unit 240 performs thefollowing processing steps in order to determine the kinds of accents:

(1) with respect to each of the accent phrases, normalizing thefundamental frequencies into values not less than 0 and not more than 1;

(2) with respect to each mora (in Japanese, onji) m_(i) within each ofthe accent phrases, computing a slope g_(i) of each mora m_(i) throughleast squares approximation;

(3) setting an ending pitch of each mora as e_(i), or, in a case wherethere is a sudden change in a border between moras, performing differentprocessing;

(4) using evaluation functions, regarding whether each mora is UP, DOWNHIGH, or LOW which are defined as follows:.

F-LOW(|g _(i) |,e _(i))=fflat(|g _(i)|)×(1−e _(s)),

F-HIGH(|g _(i) |,e _(i))=fflat(|g _(i)|)×e _(i)

F-UP(|g _(i) |,e _(i))=(1−fflat(|g _(i)|))×e _(i), and

F-DOWN(|g _(i) |,e _(i))=(1″fflat(|g _(i)|))×(1×e _(i)),

where the function “fflat” is a sigmoid function or a function similarthereto;

(5) so that totals of the respective evaluation functions F-LOW, F-HIGH,F-UP and F-DOWN can be 1 with respect to each mora, performingnormalization; and

(6) finding a combination of the evaluation functions F-LOW, F-HIGH,F-UP and F-DOWN which maximizes values of the respective evaluationfunctions, and judging the combination as a kind of accents.

By performing the above described processing steps, the accentdetermining unit 240 can determine kinds of accents with respect to thefirst to-third accent phrases 800 to 820, and a union between the fourthand fifth accent phrases 830 and 840.

The first learning data 30 is generated by associating the thusrecognized accents with the phrases and the pronunciations which havebeen recognized by the speech recognizing unit 210.

Referring also to FIG. 5, based on the first learning data 30, thefrequency data generating unit 270 generates the frequency dataindicating a wording of phrases, and appearance frequencies of readingsthereof (S530). The thus generated frequency data 40 may be newlygenerated, or may be generated by synthesizing the frequency dataacquired by the acquisition unit 260 and the frequency data determinedbased on the first learning data 30. One example of the processing ofgenerating the frequency data through the synthesis will be described.

First of all, with respect to each combination of plural phrasescontinuously written, the frequency data generating unit 270 generates afrequency data candidate by taking a weighted average of an appearancefrequency of each combination of readings of the phrases in thefrequency data acquired by the acquisition unit 260, and an appearancefrequency at which each combination of readings of the phrases appearsin the frequency data in the first learning data 30. Preferably, thefrequency data generating unit 270 may generate the plural frequencydata candidates in which different weights are used in taking theweighted averages respectively corresponding thereto. That is, forexample, if a combination of the reading has the index value being 30 inthe acquired frequency data the index value being 60 in the firstlearning data 30, and the weights being 1:1, 45 (=(30+60)/2) is computedas the appearance frequency of the combination in the frequency datacandidate. Additionally, if the combination has the weights being 2:1with the other conditions being the same, 50 (=(30×1+60×2)/3) iscomputed as the appearance frequency of the combination.

Next, with respect to each of the frequency data candidates, thefrequency data generating unit 270 causes the language processing unit52 to generate a reading of the wording by using each of the frequencydata candidates, on the basis of the wording of the phrases in the firstlearning data 30. Then, with respect to each of the frequency datacandidates, the frequency data generating unit 270 computes a rate atwhich the reading having been caused to be generated by the languageprocessing unit 52, coincides with a reading in the first learning data30. For example, the rate of how much the reading coincide therewith maybe evaluated by a number of phrases in the wording which coincide inthese readings, or may be evaluated by a number of characters in thewriting which coincide in these readings. The frequency data generatingunit 270 judges, as new frequency data, the frequency data candidatemaking the rate at which the reading coincide therewith not less than apredetermined criterion, and outputs the frequency data candidate to thesetting unit 280. The frequency data generating unit 270 may judge, asnew frequency data, the frequency data candidate making the highest rateat which the reading coincide therewith.

Thus, by updating already-existing frequency data by using new learningdata, speech synthesis is made possible even in a case where asufficient amount of learning data is not available and synthesizedspeech in which an individuality of speech for learning is reflected canbe outputted.

Subsequently, the setting unit 280 sets the frequency data 40 in thelanguage processing unit 52 (S540). Additionally, in order to generatethe second learning data, the accent determining unit 240 determinesaccents of phrases based on the speech inputted for learning (S550). Thethus determined accents form the second learning data along withinformation indicating: the pronunciations recognized by the speechrecognizing unit 210; the fundamental frequencies extracted by thefundamental frequency extracting unit 220; and the separations betweenaccent phrases estimated by the accent phrase estimating unit 230. Basedon data such as readings of phrases, phoneme alignments, positions ofaccent phrases, positions of pauses, and the like, the sound datagenerating unit 290 generates sound data such as waveform data andrhythm models (S560). The setting unit 280 sets this sound data in thesound processing unit 54 in order to approximate outputted speech oftext-to-speech to the speech recognized for learning (S570).

FIG. 9 shows one example of text-to-speech processing performed based onthe frequency data having been set. In order for the language processingunit 52 to select a reading and accents based on a wording of text, thelanguage processing unit 52 uses information on appearance frequenciescomputed with respect to each combination of phrase wordings, and withrespect to each combination of readings as will be described iconnection with this drawing. Hereinafter, this processing will despecifically described First of all, the language processing unit 52acquires text including plural phrases. This text is, for example, setas “yamada kun ha kyouto tawaa . . . . ” In this text, separationsbetween phrases of the subjected text are not explicit.

The language processing unit 52 first selects a part corresponding to“yamada kun” within this text as a subjected text piece 900 a which is acombination of phrase wordings subjected to the processing. The languageprocessing unit 52 retrieves a combination of wordings coinciding withthe subjected text piece 900 a from the combinations of phrase wordingscontained in the frequency data 40. For example, the language processingunit 52 may retrieve from the frequency data 40 a combination of aphrase 910 a which is “yamada,38 and a phrase 910 b which is “kun,” andfurther retrieve therefrom a combination of a phrase 910 c which is“yama,” and a phrase 910 d which is “dakun.”

At this time, in the frequency data 40, the wording “yamada” isassociated with an accent continuously and naturally pronounced as“yamada” which may be popular surname and place name in Japan, whereasthe wording “yama” is associated with an accent appropriate for ageneral noun meaning a mountain and the like. Additionally, although theplural combinations of wordings in which borders between phrases aredifferent are shown for the convenience of explanation in the example ofthis drawing, combinations of wordings in which borders between phrasesare the same only with readings or accents being different may beretrieved.

Then, with respect to each of combinations of readings and accentscorresponding to the combination of the retrieved wordings, the languageprocessing unit 52 acquires, from the frequency data 40, appearancefrequencies associated with each of the combinations. For example,suppose that in learning data, the number of times when the phrases 910a and 910 b appear in series is nine times, and the number of times whenthe phrases 910 c and 910 d appear in series is once. Then, an indexvalue indicating an appearance frequency which is nine times as large asan appearance frequency of the combination of phrases 910 c and 910 d isassociated with the combination of the phrases, 910 a and 910 b in thefrequency data 40.

Subsequently, the language processing unit 52 shifts the processing to anext subjected text piece. For example, the language processing unit 52selects the wording “dakun ha” as a subjected text piece 900 b. Thelanguage processing unit 52 retrieves: a combination of the phrase 910 dwhich is “dakun,” and a phrase 910 e which is “wa”; and a combination ofthe phrase 910 d which is “dakun,” and a phrase 910 f which is “ha”.Here, the phrases 910 e and 910 f are the same in wording, but areretrieved separately because they have different readings or differentaccents. The language processing unit 52 computes an appearancefrequency at which the phrases 910 d and 910 e appear in series, and anappearance frequency at which the phrases 910 d and 910 f appear inseries.

Furthermore, the language processing unit 52 shifts the processing to anext subjected text piece. For example, the language processing unit 52selects the wording “kun ha” as a subjected text piece 900 c. Thelanguage processing unit 52 retrieves: a combination of the phrase 910 bwhich is “kun,” and the phrase 910 e which is “wa”; and a combination ofthe phrase 910 b which is “kun,” and the phrase 910 f which is “ha”. Thelanguage processing unit 52 acquires an appearance frequency at whichthe phrases 910 b and 910 e appear in series, and an appearancefrequency at which the phrases 910 b and 910 f appear in series.

Thereafter, the language processing unit 52 sequentially selects asubjected text piece 900 d, a subjected text piece 900 e, and asubjected text piece 900 f. Then, with respect to each of combinationsof wordings coinciding with wordings of each of the subjected textpieces, the language processing unit 52 acquires appearance frequenciesat which combinations of readings and accents thereof appear. Finally,with respect to each of paths in each of which combinations of wordingscoinciding with a part of the inputted text are sequentially selected,the language processing unit 52 computes a product of appearancefrequencies of these combinations of wordings. As one example, withrespect to a path in which the phrase 910 a, the phrase 910 b, thephrase 910 e, a phrase 910 g and a phrase 910 h are sequentiallyselected, the language processing unit 52 computes a product ofappearance frequencies of these combinations of wordings.

This computation processing is preferably generalized into the followingequation (1):

$\begin{matrix}{{M_{u}\left( {u_{1}u_{2}\mspace{11mu} \cdots \mspace{11mu} u_{h}} \right)} = {\prod\limits_{t = 1}^{h + 1}{P\left( u_{i} \middle| {u_{l - k}\mspace{11mu} \cdots \mspace{11mu} u_{i - 2}u_{i - 1}} \right)}}} & (1)\end{matrix}$

In this equation, h indicates a number of combinations of wordings, andis 5 in the example of this drawing. Additionally, k is a number ofphrases in the context which are considered retroactively, and k=1 inthe example of this drawing because a 2-gram model is supposed therein.Additionally, u=(w,t,s,a), where w, t, s, and a correspond to therespective characters in FIG. 3, and denote a wording, a word class, apronunciation, and an accent.

The language processing unit 52 selects a combination of readings and acombination of accents which give the largest one of the appearancefrequencies computed with respect to the respective paths. Thisselection processing is preferably generalized into the followingequation (2):

û=argmaxM _(u)(u ₁ u ₂ . . . u _(h) |x ₁ x ₂ . . . x _(h))   (2)

In this equation, x₁x₂ . . . x_(h) indicates text acquired by thelanguage processing unit 52, and each of x₁, and x₂ to x_(h) is acharacter.

According to the above computation and selection processing, based onappearance frequencies of combinations of readings contained in thefrequency data 40,: the language processing unit 52 can determine areading of each phrase in acquired text in consideration of the contextthereof.

FIG. 10 shows one example of a hardware configuration of an informationprocessing apparatus 600 which functions as the supporting system 20.The information processing apparatus 600 has: a CPU peripheral sectionincluding a CPU 1000, a RAM: 1020 and a graphic controller 1075 whichare mutually connected by a host controller 1082; an input/outputsection including a communication interface 1030, a hard disk drive 1040and a CD-ROM drive 1060 which are connected with the host controller1092 via an input/output controller 1084; and a legacy input/outputsection including a ROM 1010, a flexible disk drive 1050 and aninput/output chip 1070 which are connected with the input/outputcontroller 1084.

The host controller 1082 connects the RAM 1020 with the CPU 1000 and thegraphic controller 1075 which access the RAM 1020 at a high transferrate. The CPU 1000 operates based on programs stored in the ROM 1010 andthe RAM 1020, and controls the respective sections. The graphiccontroller 1075 obtains image data generated by the CPU 1000 and thelike on a frame buffer provided within the RAM 1020, and displays theimage data on a display device 1080. Instead of this, the graphiccontroller 1075 may contain therein a frame buffer for storing imagedata generated by the CPU 1000 and the like.

The input/output controller 1084 connects the host controller 1082 withthe communication interface 1030, the hard disk drive 1040 and theCD-ROM drive 1060 which are relatively high-speed input/output devices.The communication interface 1030 communicates with an external apparatusvia a network. The hard disk drive 1040 stores programs and data used bythe information processing apparatus 600. The CD-ROM drive 1060 readsout a program or data from a CD-ROM 1095 and supplies it to the RAW 1020or the hard disk drive 1040.

Additionally, the relatively low-speed input/output devices includingthe ROM 1010, the flexible disk drive 1050 and the input/output chip1070 are connected with the;, input/output controller 1084. The ROM 1010stores: a boot program executed by the CPU 1000 at the startup of theinformation processing apparatus 600; programs dependent on the hardwareof the information processing apparatus 600; and the like. The flexibledisk drive 1050 reads a program or data from the flexible disk 1090 andsupplies it to the RAM 1020 or the hard disk drive 1040 via theinput/output chip 1070. The input/output chip 1070 connects the variousinput/output devices through the flexible disk 1090, and through, forexample, a parallel port, a serial port, a keyboard port and a mouseport.

A program provided to the information processing apparatus 600 is storedin the flexible disk 1090, the CD-ROM 1095 or a recording medium such asan IC card, and is provided by the user. The program is read from therecording medium through at least any one of the input/output chip 1070and the input/output controller 1084, and is installed in theinformation processing apparatus 600 to be executed. Operations whichthe program causes the information processing apparatus 600 and the liketo execute are the same with those in the supporting system 20 whichhave been described in connection with FIGS. 1 to 9, and therefore,description thereof will be omitted.

The program described above may be stored in an external recordingmedium. As the recording medium, any one of an optical recording mediumsuch as a DVD and a PD, a magneto-optic recording medium such as an MD,a tape medium, a semiconductor memory such as an IC card, and the likemay be used other than the flexible disk 1090 and the CD-ROM 1095.Additionally, the program may be supplied to the information processingapparatus 600 via the network by using as the recording medium a storagedevice such as a hard disk and a RAM provided in a server systemconnected with a dedicated communication network or the Internet.

The invention can take the form of an entirely hardware embodiment, anentirely software embodiment or an embodiment containing both hardwareand software elements. In a preferred embodiment, the invention isimplemented in software, which includes but is not limited to firmware,resident software, microcode, etc.

Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable mediumproviding program code for use by or in connection with a computer orany instruction execution system. For the purposes of this description,a computer-usable or computer readable medium can be any apparatus thatcan contain, store, communicate, propagate, or transport the program foruse by or in connection with the instruction execution system,apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system (or apparatus or device) or apropagation medium. Examples of a computer-readable medium include asemiconductor or solid state memory, magnetic tape, a removable computerdiskette, a random access memory (RAM), a read-only memory (ROM), arigid magnetic disk and an optical disk. Current examples of opticaldisks include compact disk—read only memory (CD-ROM), compactdisk—read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through, a system bus. The memory elementscan-include local memory employed during actual execution of the programcode, bulk storage, and cache memories which provide temporary storageof at least some program code in order to reduce the number of timescode must be retrieved from bulk storage during execution.

Input/output or I/O devices (including but not limited to keyboards,displays, pointing devices, etc.) can be coupled to the system eitherdirectly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

As has been described hereinabove, according to the supporting system 20according to this embodiment, natural synthesized speech based on thespeech for learning can be generated more effectively than before.Thereby, various sorts of synthesized speech in which dialects andindividualities are reflected can be generated with low cost within ashort time. According to this technology, for example, applications aswill be described below become possible.

APPLICATION EXAMPLE 1

Conventionally, synthesized speech has been often generated inaccordance with manners of speech of announcers, and a technicaldifficulty has been involved in work of reflecting therein arrangementof a dialect or an individuality. According to this embodiment,synthesized speech in which an individuality of the speech for learningis; reflected can be easily generated. Thereby, for example, bycollecting pieces of speech vocalized by a patient of a throat diseasebefore the patient loses voice, and by utilizing them as the speech forlearning, synthesized speech in which the patient's own individuality isreflected can be outputted even after the patient loses voice.

APPLICATION EXAMPLE 2

Some corporations may have a voice (a corporate voice) determined, thevoice being used for marketing of the corporations, for example, in anautomated answering system at a call center, or in a TV/radio commercialmessage. It has not been possible to use conventional synthesized speechin automated synthesis of such a corporate voice because theconventional synthesized speech simulates a manner of speech of anannouncer. According to this embodiment, various corporate voices whichdiffer from corporation to corporation can be efficiently generated.

APPLICATION EXAMPLE 3

Car navigation systems, home videogame machines, or personal robots mayin (some cases have a function of conversing with a user. For example,they may be enables to operate in accordance with speech of the user,and to communicate a result of processing to the user by way of speech.According to this embodiment, various sorts of speech can beinexpensively generated, such speech for conversation can be easilytailored to the user's liking.

While the present invention has been described with reference to aparticular preferred embodiment and the accompanying drawings, it willbe understood by those skilled in the art that the invention is notlimited to the preferred embodiment and that various modifications andthe like could be made thereto without departing from the scope of theinvention as defined in the following claims.

1. A system for supporting text-to-speech, comprising: a learning datagenerating unit which recognizes inputted speech, and generates firstlearning data in which wordings of phrases are associated with readingsthereof; a frequency data generating unit which generates, on the basisof the first learning data, frequency data indicating appearancefrequencies of both wordings and readings of phrases; a languageprocessing unit; and a setting unit which sets frequency data in thelanguage processing unit for generating, from a wording of text, areading corresponding to the wording, on the basis of appearancefrequencies of readings corresponding to the wording in order toapproximate outputted speech of text-to-speech to the inputted speech.2. The system according to claim 1, further comprising a sound datagenerating unit, wherein: the learning data generating unit recognizesspeech, and generates second learning data in which readings of phrasesare associated with waveform data of phonemes; the sound data generatingunit generates sound data based on wordings of phrases and waveform dataof phonemes in the second learning data; a sound processing unit; andthe setting unit sets the sound data in the sound processing unit whichoutputs speech in accordance with readings of phrases in order toapproximate the outputted speech of text-to-speech to the recognizedspeech.
 3. The system according to claim 1, wherein on the basis of thefirst learning data, the frequency data generating unit generates, asthe frequency data, an appearance frequency at which each ofcombinations of readings corresponding to each of combinations of pluralphrases continuously written appears.
 4. The system according to claim3, further comprising an acquisition unit which acquires frequency datahaving been already set in the language processing unit, wherein: thefrequency data generating unit generates a frequency data candidate foreach of combinations of plural phrases continuously written, by taking aweighted average of an appearance frequency of each combination ofreadings of the phrases in the frequency data having been acquired bythe acquisition unit, and an appearance frequency at which thecombination of readings of the phrases appears in the first learningdata; with respect to the entire readings which are generated by thelanguage processing unit on the basis of a phrase wording in the firstlearning data, and which correspond to each of plural frequency datacandidates generated by using different weights in taking the weightedaverages, the frequency data generating unit computes a ratio ofreadings identical with those in the first learning data to the entirereadings; and the frequency data generating unit generates, as newfrequency data, the frequency data candidate making the identicalreading ratio not less than a predetermined criterion.
 5. The systemaccording to claim 3, wherein: the learning data generating unitrecognizes inputted speech, and generates first learning data in which awording, a reading and a word class of each of phrases are associatedwith one another; and on the basis of the first learning data, thefrequency data generating unit generates, as the frequency data,appearance frequencies of combinations of readings, and appearancefrequencies of combinations of word classes in each of combinations ofplural phrases continuously written.
 6. The system according to claim 1,wherein: the learning data generating unit recognizes inputted speech,and generates first learning data in which a wording, a reading and aword class of each of phrases are associated with one another; and thefrequency data generating unit generates, as the frequency dataindicating appearance frequencies of readings of the phrases, frequencydata indicating appearance frequencies of each of the combinations ofpronunciations and accents of the phrases.
 7. The system according toclaim 6, wherein: the learning data generating unit divides the speechof each phrase into plural accent phrases, classifies an accent of thespeech in each of the accent phrases into any one of predeterminedplural kinds of accents, and generates, as a reading of each phrase, agroup of kinds of accents in each of the accent phrases; and thefrequency data generating unit generates the frequency data by judgingthat an appearance frequency of the group of kinds of accents in each ofthe accent phrases of a phrase in the first learning data is anappearance frequency of a reading of the phrase in the first learningdata.
 8. A system for supporting text-to-speech, comprising: a learningdata generating unit which recognizes inputted speech, and generatesfirst learning data in which wordings of phrases are associated withreadings thereof; a language processing unit; and a learning unit whichcauses the language processing unit to learn on the basis of the firstlearning data, the language processing unit generating, from a wordingof text, a reading corresponding to the wording, on the basis ofappearance frequencies in the first learning data in order toapproximate outputted speech of text-to-speech to the inputted speech.9. The system according to claim 8, further comprising a sound datagenerating unit, wherein: the learning data generating unit recognizesspeech, and generates second learning data in which readings of phrasesare associated with waveform data of phonemes; the sound data generatingunit generates sound data based on wordings of phrases and waveform dataof phonemes in the second learning data; a sound processing unit; and asetting unit sets the sound data in the sound processing unit whichoutputs speech in accordance with readings of phrases in order toapproximate the outputted speech of text-to-speech to the recognizedspeech.
 10. The system according to claim 8, wherein on the basis of thefirst learning data, the frequency data generating unit generates, asthe frequency data, an appearance frequency at which each ofcombinations of readings corresponding to each of combinations of pluralphrases continuously written appears.
 11. The system according to claim10, further comprising an acquisition unit which acquires frequency datahaving been already set in the language processing unit, wherein thefrequency data generating unit generates a frequency data candidate foreach of combinations of plural phrases continuously written, by taking aweighted average of an appearance frequency of each combination ofreadings of the phrases in the frequency data having been acquired bythe acquisition unit, and an appearance frequency at which thecombination of readings of the phrases appears in the first learningdata; with respect to the entire readings which are generated by thelanguage processing unit on the basis of a phrase wording in the firstlearning data, and which correspond to each of plural frequency datacandidates generated by using different weights in taking the weightedaverages, the frequency data generating unit computes a ratio ofreadings identical with those in the first learning data to the entirereadings; and the frequency data generating unit generates, as newfrequency data, the frequency data candidate making the identicalreading ratio not less than a predetermined criterion.
 12. The systemaccording to claim 10, wherein: the learning data generating unitrecognizes inputted speech, and generates first learning data in which awording, a reading, and a word class of each of phrases are associatedwith one another; and on the basis of the first learning data, thefrequency data generating unit generates, as the frequency data,appearance frequencies of combinations of readings, and appearancefrequencies of combinations of word classes in each of combinations ofplural phrases continuously written.
 13. The system according to claim8, wherein: the learning data generating unit recognizes inputtedspeech, and generates first learning data in which a wording, a reading,and a word class of each of phrases are associated with one another; andthe frequency data generating unit generates, as the frequency dataindicating appearance frequencies of readings of the phrases, frequencydata indicating appearance frequencies of each of the combinations ofpronunciations and accents of the phrases.
 14. The system according toclaim 13, wherein: the learning data generating unit divides the speechof each phrase into plural accent phrases, classifies an accent of thespeech in each of the accent phrases into any one of predeterminedplural kinds of accents, and generates, as a reading of each phrase, agroup of kinds of accents in each of the accent phrases; and thefrequency data generating unit generates the frequency data by judgingthat an appearance frequency of the group of kinds of accents in each ofthe accent phrases of a phrase in the first learning data is anappearance frequency of a reading of the phrase in the first learningdata.
 15. A method of supporting text-to-speech, comprising the stepsof: recognizing inputted speech, and generating first learning data inwhich wordings of phrases are associated with readings thereof;generating, on the basis of the first learning data, frequency dataindicating appearance frequencies of both wordings, and readings ofphrases; and setting frequency data in a language processing unit whichgenerates, from a wording of text, a reading corresponding to thewording, on the basis of appearance frequencies of readingscorresponding to the wording in order to approximate outputted speech oftext-to-speech to the inputted speech.
 16. The method according to claim15, further comprising the steps of: recognizing speech, and generatingsecond learning data in which readings of phrases are associated withwaveform data of phonemes; generating sound data based on wordings ofphrases and waveform data of phonemes in the second learning data; andsetting the sound data in a sound processing unit which outputs speechin accordance with readings of phrases in order to approximate theoutputted speech of text-to-speech to the recognized speech.
 17. Themethod according to claim 15, further comprising the step of generatingas the frequency data, on the basis of the first learning data, anappearance frequency at which each of combinations of readingscorresponding to each of combinations of plural phrases continuouslywritten appears.
 18. The method according to claim 17, furthercomprising the steps of: acquiring frequency data having been alreadyset in the language processing unit; generating a frequency datacandidate for each of combinations of plural phrases continuouslywritten, by taking a weighted average of an appearance frequency of eachcombination of readings of the phrases in the frequency data having beenacquired in the step of acquiring, and an appearance frequency at whichthe combination of readings of the phrases appears in the first learningdata; with respect to the entire readings which are generated by thelanguage processing unit on the basis of a phrase wording in the firstlearning data, and which correspond to each of plural frequency datacandidates generated by using different weights in taking the weightedaverages, computing a ratio of readings identical with those in thefirst learning data to the entire readings; and generating, as newfrequency data, the frequency data candidate making the identicalreading ratio not less than a predetermined criterion.
 19. The methodaccording to claim 17, wherein: recognizing inputted speech andgenerating first learning data in which a wording, a reading and a wordclass of each of phrases are associated with one another; and on thebasis of the first learning data, generating, as the frequency data,appearance frequencies of combinations of readings, and appearancefrequencies of combinations of word classes in each of combinations ofplural phrases continuously written.
 20. The method according to claim15, further comprising the steps of: recognizing inputted speech andgenerating first learning data in which a wording, a reading and a wordclass of each of phrases are associated with one another; andgenerating, as the frequency data indicating appearance frequencies ofreadings of the phrases, frequency data indicating appearancefrequencies of each of the combinations of pronunciations and accents ofthe phrases.
 21. The method according to claim 20, further comprisingthe steps of: dividing the speech of each phrase into plural accentphrases; classifying an accent of the speech in each of the accentphrases into any one of predetermined plural kinds of accents;generating, as a reading of each phrase, a group of kinds of accents ineach of the accent phrases; and generating the frequency data by judgingthat an appearance frequency of the group of kinds of accents in each ofthe accent phrases of a phrase in the first learning data is anappearance frequency of a reading of the phrase in the first learningdata.
 22. A program product for allowing an information processingapparatus to function as a system for supporting text-to-speech, theprogram product causing the information system to function as: alearning data generating unit which recognizes inputted speech, andgenerates first learning data in which wordings of phrases areassociated with readings thereof; a frequency data generating unit whichgenerates, on the basis of the first learning data, frequency dataindicating appearance frequencies of both wordings, and readings ofphrases; and a setting unit which, in order to approximate outputtedspeech of text-to-speech to the inputted speech, sets frequency data ina language processing unit for generating, from a wording of text, areading corresponding the wording, on the basis of appearancefrequencies of readings corresponding to the wording.
 23. The programproduct according to claim 22, wherein the program product furthercauses the information system to function as a sound data generatingunit, wherein the learning data generating unit recognizes speech, andgenerates second learning data in which readings of phrases areassociated with waveform data of phonemes; the sound data generatingunit generates sound data based on wordings of phrases and waveform dataof phonemes in the second learning data; and the setting unit sets thesound data in a sound processing unit which outputs speech in accordancewith readings of phrases in order to approximate the outputted speech oftext-to-speech to the recognized speech.
 24. The program productaccording to claim 22, wherein the program product further causes theinformation system to function such that on the basis of the firstlearning data, the frequency data generating unit generates, as thefrequency data, an appearance frequency at which each of combinations ofreadings corresponding to each of combinations of plural phrasescontinuously written appears.
 25. The program product according to claim24, wherein the program product further causes the information system tofunction as an acquisition unit which acquires frequency data havingbeen already set in the language processing unit, wherein: the frequencydata generating unit generates a frequency data candidate for each ofcombinations of plural phrases continuously written, by taking aweighted average of an appearance frequency of each combination ofreadings of the phrases in the frequency data having been acquired bythe acquisition unit, and an appearance frequency at which thecombination of readings of the phrases appears in the first learningdata; with respect to readings (called, the entire readings) which aregenerated by the language processing unit on the basis of a phrasewording in the first learning data, and which correspond to each ofplural frequency data candidates generated by using different weights intaking the weighted averages, the frequency data generating unitcomputes a ratio of readings identical with those in the first learningdata to the entire readings; and the frequency data generating unitgenerates, as new frequency data, the frequency data candidate makingthe identical reading ratio not less than a predetermined criterion. 26.The program product according to claim 24, wherein the program productfurther causes the information system to function such that the learningdata generating unit recognizes inputted speech, and generates firstlearning data in which a wording, a reading and a word class of each ofphrases are associated with one another; and on the basis of the firstlearning data, the frequency data generating unit generates, as thefrequency data, appearance frequencies of combinations of readings, andappearance frequencies of combinations of word classes in each ofcombinations of plural phrases continuously written.
 27. The programproduct according to claim 22, wherein the program product furthercauses the information system to function such that the learning datagenerating unit recognizes inputted speech, and generates first learningdata in which a wording, a reading and a word class of each of phrasesare associated with one another; and the frequency data generating unitgenerates, as the frequency data indicating appearance frequencies ofreadings of the phrases, frequency data indicating appearancefrequencies of each of the combinations of pronunciations and accents ofthe phrases.
 28. The program product according to claim 27, wherein theprogram product further causes the information system to function suchthat the learning data generating unit divides the speech of each phraseinto plural accent phrases, classifies an accent of the speech in eachof the accent phrases into any one of predetermined plural kinds ofaccents, and generates, as a reading of each phrase, a group of kinds ofaccents in each of the accent phrases; and the frequency data generatingunit generates the frequency data by judging that an appearancefrequency of the group of kinds of accents in each of the accent phrasesof a phrase in the first learning data is an appearance frequency of areading of the phrase in the first learning data.
 29. An article ofmanufacture comprising a computer usable medium having computer readableprogram code means embodied therein for supporting text-to-speech, thecomputer readable program code means in said article of manufacturecomprising: computer readable program code means for causing a computerto effect recognizing inputted speech and generating first learning datain which wordings of phrases are associated with readings thereof;computer readable program code means for causing a computer to effectgenerating, on the basis of the first learning data, frequency dataindicating appearance frequencies of both wordings, and readings ofphrases; and computer readable program code means for causing a computerto effect setting frequency data in a language processing unit whichgenerates, from a wording of text, a reading corresponding to thewording, on the basis of appearance frequencies of readingscorresponding to the wording in order to approximate outputted speech oftext-to-speech to the inputted speech.
 30. The article of manufactureaccording to claim 29, further comprising computer readable program codemeans for causing a computer to effect: recognizing speech, andgenerating second learning data in which readings of phrases areassociated with waveform data of phonemes; generating sound data basedon wordings of phrases and waveform data of phonemes in the secondlearning data; and setting the sound data in a sound processing unitwhich outputs speech in accordance with readings of phrases in order toapproximate the outputted speech of text-to-speech to the recognizedspeech.
 31. The article of manufacture according to claim 29, furthercomprising computer readable program code means for causing a computerto effect generating as the frequency data, on the basis of the firstlearning data, an appearance frequency at which each of combinations ofreadings corresponding to each of combinations of plural phrasescontinuously written appears.
 32. The article of manufacture accordingto claim 31, further comprising computer readable program code means forcausing a computer to effect: acquiring frequency data having beenalready set in the language processing unit; generating a frequency datacandidate for each of combinations of plural phrases continuouslywritten, by taking a weighted average of an appearance frequency of eachcombination of readings of the phrases in the frequency data having beenacquired in the step of acquiring, and an appearance frequency at whichthe combination of readings of the phrases appears in the first learningdata; with respect to the entire readings which are generated by thelanguage processing unit on the basis of a phrase wording in the firstlearning data, and which correspond to each of plural frequency datacandidates generated by using different weights in taking the weightedaverages, computing a ratio of readings identical with those in thefirst learning data to the entire readings; and generating, as newfrequency data, the frequency data candidate making the identicalreading ratio not less than a predetermined criterion.
 33. The articleof manufacture according to claim 31, further comprising computerreadable program code means for causing a computer to effect:recognizing inputted speech and generating first learning data in whicha wording, a reading and a word class of each of phrases are associatedwith one another; and on the basis of the first learning data,generating, as the frequency data, appearance frequencies ofcombinations of readings, and appearance frequencies of combinations ofword classes in each of combinations of plural phrases continuouslywritten.
 34. The article of manufacture according to claim 29, furthercomprising computer readable program code means for causing a computerto effect: recognizing inputted speech and generating first learningdata in which a wording, a reading and a word class of each of phrasesare associated with one another; and generating, as the frequency dataindicating appearance frequencies of readings of the phrases, frequencydata indicating appearance frequencies of each of the combinations ofpronunciations and accents of the phrases.
 35. The article ofmanufacture according to claim 34, further comprising computer readableprogram code means for causing a computer to effect: dividing the speechof each phrase into plural accent phrases; classifying an accent of thespeech in each of the accent phrases into any one of predeterminedplural kinds of accents; generating, as a reading of each phrase, agroup of kinds of accents in each of the accent phrases; and generatingthe frequency data by judging that an appearance frequency of the groupof kinds of accents in each of the accent phrases of a phrase in thefirst learning data is an appearance frequency of a reading of thephrase in the first learning data.